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Abstract 

The performance potential for simulating quantum electron transport on graphical processing units (GPUs) is studied. 
^^ Using graphene ribbons of realistic sizes as an example it is shown that GPUs provide significant speed-ups in comparison 
(*-K to central processing units as the transverse dimension of the ribbon grows. The recursive Green's function algorithm 
fv^ is employed and implementation details on GPUs are discussed. Calculated conductances were found to accumulate 

^ significant numerical error due to single-precision floating-point arithmetic at energies close to the charge neutrality 

F-{ point of the graphene. 
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During several decades the demands and importance of 
high-performance computation in physics have been steadily 
growing. Electron quantum transport simulations are one 
of the relevant fields. They are usually performed either 
for small device geometries or greatly reduced basis set of 
underlying atomic orbitals. Rough approximations, e.g., 
parabolic dispersion relations or plane wave bases, are fre- 
quently employed to compute the electrical conductances 
and currents. [1] However, only ab-initio computations are 
able to recover physical phenomena correctly and assess 
the validity of simpler approaches. Before the simulation 
even starts it is nessesarily to compromise between ac- 
curacy and computation speed. Such a choice is mainly 
determined by the available computer resources. At the 
same time experimental data is usually available for de- 
vices whose geometries compose thousands or even mil- 
lions of atoms. A prominent example of such devices 
is graphene nanoribbons studied experimentally by Lin 
etal.p] Their devices were 30 nm wide and 900 - 1700 nm 
long. The devices were fabricated from a graphene mono- 
layer that represents a honeycomb lattice of carbon atoms 
separated by a 0.142 nm distance. This immediately gives 
1-2 million carbon atoms that need to be taken into ac- 
count. To treat such large quantities advanced numerical 
methods should be employed, for example, the recursive 
Green's function formalism. [U [3] Even though such meth- 
ods are used one still faces a need for huge computing 
power. The study presented here explores the possibil- 
ity of using a Graphical Processing Unit (GPU) for ef- 
fective computing of quantum electron transport through 
graphene nanoribbons. 

Over the last few years it has been realized that the 
large computational power of GPU could be used for pur- 
poses other than the video game industry. This power 
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results from the relative simplicity of the GPUs architec- 
ture. Its performance exceeds that of Central Processing 
Unit (CPU) by large factors because of the large num- 
ber of parallel processing units on a single chip. By de- 
sign, GPUs are optimized for manipulating a large number 
of graphics primitives in parallel, which often amounts to 
simple, floating-point matrix calculations. In contrast to 
current CPUs, they are not designed to cope with "unex- 
pected" branches in the code, or for executing a single- 
threaded program as fast as possible. While this makes 
GPUs not well suited as drop-in replacements for CPUs, 
their highly parallel architecture might be advantageous 
for scientific calculations with a large part of paralleliz- 
able code. Their original design for graphics calculations 
however entails certain design features which are not neces- 
sarily optimal for scientific computational tasks, such as a 
special hierarchy of memory organization or a restriction to 
efficient floating-point calculations only in single precision 
arithmetic. GPU usage for scientific applications became 
much easier with advent of language extension NVIDIA 
CUDA. 4^ The results presented below were obtained us- 
ing a code written in the C language within the CUDA 
framework. 

The structures used for computer simulations are graphene 
nanoribbons. They are made from a single layer graphene 
tailored into narrow strips of width W , see Fig. 1(a). Nu- 
merical computations are explicitly performed for a cen- 
tral scattering region of length L. It is attached at its ends 
to two semi-infinite graphene regions of the same width. 
These semi-infinite regions serve as electron reservoirs sup- 
plying conducting electrons to the system. To simplify the 
computation each carbon atom is represented by a single p 
orbital. To a first approximation, this orbital determines 
the electronic properties of the graphene. The atomic con- 
figuration along the edges of the ribbon is armchair for all 
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Figure 1: (a) Graphene nanoribbon. Honeycomb lattice of carbon 
atoms forms a strip of width W and length L. Semi-infinite leads are 
attached on both ends of the ribbon, (b) Schematic of the device for 
application of Dyson's equation by splitting the device in two parts. 



of the following results. 

One of the most effective methods for quantum electron 
transport computation is the recursive Green's function 
formalism. [I] [3 [7] Its central quantity is the Green's 
function [1] 

g^iE-H)-' (1) 

where E is the electron energy and H is the Hamiltonian 
of the system. The latter includes the effects of atomic 
orbitals, leads, different lattice defects and others. Knowl- 
edge of the Green's function allows one to compute any 
desired quantity, e.g. charge density or conductance, that 
can be directly compared with experimentally measured 
quantities, ilj The conductance reads 



h 



(2) 



Here Tr represents the trace and the elements of the matrix 
rfi(L) couple the leads with the scattering region, see Ref. 
[1] for details. The retarded and advanced Green's func- 
tion g^ and t/" are computed by the recursive algorithm[5J 
[7] via successive application of the Dyson equation 



g = g" + g^'^g. 



(3) 



This equation relates the Green's function of the full sys- 
tem Z + Z' in terms of the subsystems Z , Z' and the cou- 
pling E between Z and Z\ see Figure 1(b). The algorithm 
starts by partitioning the scattering region into slices each 
accounting for the on-site interactions as well as interac- 
tions between elements within the given slice. Each slice is 
represented by a square matrix of size equal to the number 
of orbitals iV (or carbon atoms in the present study). The 
Green's functions at slice j, after simple algebra, can be 



Q\] = gi,j-i^'^gj]- 



^r'Q'v 



(4) 
(5) 



These equations are systematically applied from the first 
till the last, Af-th, slice of the scattering region; 5im ^ g^ ■ 
Each recursion loop involves at least 5 matrix multipli- 
cations along with one call of the linear equation solver. 
Computation time depends on both matrix dimension as 
well as the number of recursion loops, or, in other words, 
on the dimensions of the scattering region. Note that, at 
the 1-st and Af-th slices, one should also find and include 
the self-energies due to the leads which requires solution 
of an eigen-problem. The most time consuming part of 
the algorithm is recursion over the scattering region. One 
should therefore primarily target optimization of the re- 
cursion loops, Eqs. (4)-(5). 

To attain high performance of the recursive Green's 
function algorithm and use the whole power of the GPU 
one should perform recursive calculations on the GPU 
alone, without any data transfer to the global (host) com- 
puter memory and CPU calls. Matrix-matrix operations 
involved in each recursion loop can be effectively computed 
using the highly optimized CUBLAS[H] and CULA|g li- 
braries. They provide duplicates of the well-known BLAS 
and LAPACK routines ported to the GPU. In total, a sim- 
ulation proceeds in the following way: 

• Input data, define coupling self-energies, generate 
atomic defects and other scattering parameters if 
any. 

• Compute self-energies due to the leads. 

• Allocate GPU memory, transfer data from the host 
memory to the device memory. 

• Recursion loops: Build up the Hamiltonian for each 
slice and compute gjj, gij, Eqs. (4)- (5). 

• Transfer the computed Green's functions for the whole 
scattering region back to the host memory. Deallo- 
cate GPU memory. 

• Calculate the conductance, Eq. [2j output the results. 

Figure [2] shows computation times for the conductance 
in the graphene ribbons as a function of the ribbon width. 
The ribbon length is kept constant as L = 1700 nm that 
equals the device size experimentally studied in Ref. [2]. 
Note that the time needed for computation of the lead self- 
energies, the second step in above step flow, is excluded 
from the data in Fig. [2] because it is always performed on 
the CPU and takes less than 0.3% of total computation 
time. The solid line with open squares shows timings for 
GPU GeForce 9300, which is a low-cost hardware solution. 
The dotted line with filled squares corresponds to a single 
core of a 3.0 GHz Intel Core Quad along with unoptimized 
reference BLAS and LAPACK implementation. [TOl [JT] In 
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Figure 2; Computation times for the conductance of the graphene 
ribbon vs. the ribbon width or matrix size. The matrix size is equal 
to number of carbon atoms in the cross section. The length of the 
ribbon is fixed L = 1700 nm. The dotted line with filled squares 
shows CPU timings with reference (unoptimized) BLAS and LA- 
PACK libraries. The solid line with open squares corresponds to 
GPU GeForce 9300. The dash-dotted lines with filled and open 
circles show CPU timings with Intel MKL for single and multi- 
threaded modes, respectively. The multi-threaded mode is imple- 
mented within Open MP framework. CPU is 3.0 GHz Intel Core 
Quad. 



general, the CPU performs faster for narrow ribbons, W < 
10 nm, but becomes inferior to the GPU as the ribbon 
widens. This trend agrees with other resuhs reported in 
hterature.|12l For narrow ribbons, data transfer turns out 
to be a bottleneck though massive matrix operations be- 
come a limiting factor for wider devices. 

The GPU GeForce 9300 used in the present simula- 
tions has 2 multiprocessors each boarding 8 cores. The 
latter operate at 1.2 GHz. Because the GPU is used also 
for graphic needs the operating system allocates only a 
fraction of the resources for computations. As a result 
the GPU performance doesn't approach maximum values: 
it achieves roughly half of the theoretical processing rate. 
Nevertheless it is still twice as fast as a single core of the 
3.0 GHz Intel Gore Quad for wide graphene ribbons. Fig- 
ure [2] Using a dedicated general purpose GPU, like the 
recent Tesla G1060 card, will substantially improve the 
computation rate. 

The CPU performance might be improved by using 
multi-threaded libraries like the Intel Math Kernel Library 
(MKL). The latter includes BLAS and LAPACK routines 
optimized for Intel CPUs. The results for both single and 
multi-threaded MKL are shown in Figure [2] Computation 
time drops by several times in comparison to reference im- 
plementation of BLAS and LAPACK on Netlib. This gain 
is obtained by minor efforts. Another improvement might 
be achieved via MPICH parallelization that should theo- 
retically speed-up by a factor of ~4 for the Intel Core Quad 
when compared to single-core (or single-treaded) code im- 
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Figure 3: The conductance as a function of the Fermi energy for 
the graphene nanoribbon oi W = 30 nm and L = 1700 nm. The 
conductances calculated on the GPU deviate from the exact values 
obtained on the CPU that uses double-precision arithmetic. The 
GPU computation uses single-precision floating-point arithmetic. 



plementation. This however will require modification of 
the recursion algorithm, Eqs. (4)-(5). Substantial modi- 
fications of the recursion algorithm will also be needed to 
automate (non optimized) parallel core calculations by a 
compiler itself. 

An important issue of GPU computation is its accuracy 
since single-precision floating-point arithmetic is used to 
achieve the greatest efficiency. This leads to numerical er- 
rors systematically accumulated during recursion that ex- 
ceed the corresponding numerical error on the CPU where 
double-precision arithmetic is used. The latter actually 
might be considered as an exact result because CPU ac- 
curacy was found to be better than 10^^. Note that the 
CPU performance doesn't change with the precision of the 
arithmetic. Figure [3] compares the conductances for an 
ideal graphene ribbon of M^ = 30 nm and L = 1700 nm 
for the GPU and CPU architectures. In ideal case, they 
should reveal a step-like dependence with the conductance 
being a multiple of ^. However, the GPU conductances 
show strong irregular oscillations at low energies. These 
deviations are caused by (i) the numerical error incurred 
for the quantum-mechanical system with many evanescent 
states and (ii) particular structure of the unconfined wave 
function at the first electronic subband.TS] The numerical 
error might exceed 200% at low energies though it quickly 
decreases when conduction is dominated by three or more 
propagating states, i.e. G > 3 x ^. It turns out that 
the main source of the error is the linear equation solver 
(CGESV) in the current implementation of the recursive 
Green's function algorithm. As soon as CPUs will operate 
with double-presicion arithmetic, which has became sup- 
ported in a recent generation of GPU cards, the numerical 
error is expected to descrease and the computation results 
to match the precise results. It worth finally noting that 
double-precision computation performs slower than single- 



precision one. 

In conclusion, the present study showed that electron 
quantum transport simulations might be effectively done 
using the GPU architecture that outperforms the CPU 
even for a low-cost build-in hardware solution. Compu- 
tational speed-ups might be higher by large factors for 
"cost-comparable" hardware. The recursive Green's func- 
tion algorithm can be straightforwardly implemented in 
the CUDA framework. Highly optimized CUBLAS and 
CULA libraries allow one-to-one replacement of BLAS and 
LAPACK libraries. However, because the GPU architec- 
ture operates on single-precision floating-point data sub- 
stantial numerical error might accumulate. 

The author is grateful to G. Kirczenow for critical read- 
ing of the manuscript. 
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